11 research outputs found

    When the signal is in the noise: Exploiting Diffix's Sticky Noise

    Get PDF
    Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of "sticky noise", Diffix has been recently proposed as a novel query-based mechanism satisfying alone the EU Article~29 Working Party's definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries. This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer private information about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions. We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP's definition of anonymization. [...

    Quantifying Surveillance in the Networked Age: Node-based Intrusions and Group Privacy

    Full text link
    From the "right to be left alone" to the "right to selective disclosure", privacy has long been thought as the control individuals have over the information they share and reveal about themselves. However, in a world that is more connected than ever, the choices of the people we interact with increasingly affect our privacy. This forces us to rethink our definition of privacy. We here formalize and study, as local and global node- and edge-observability, Bloustein's concept of group privacy. We prove edge-observability to be independent of the graph structure, while node-observability depends only on the degree distribution of the graph. We show on synthetic datasets that, for attacks spanning several hops such as those implemented by social networks and current US laws, the presence of hubs increases node-observability while a high clustering coefficient decreases it, at fixed density. We then study the edge-observability of a large real-world mobile phone dataset over a month and show that, even under the restricted two-hops rule, compromising as little as 1% of the nodes leads to observing up to 46% of all communications in the network. More worrisome, we also show that on average 36\% of each person's communications would be locally edge-observable under the same rule. Finally, we use real sensing data to show how people living in cities are vulnerable to distributed node-observability attacks. Using a smartphone app to compromise 1\% of the population, an attacker could monitor the location of more than half of London's population. Taken together, our results show that the current individual-centric approach to privacy and data protection does not encompass the realities of modern life. This makes us---as a society---vulnerable to large-scale surveillance attacks which we need to develop protections against

    M2^2M: A general method to perform various data analysis tasks from a differentially private sketch

    Full text link
    Differential privacy is the standard privacy definition for performing analyses over sensitive data. Yet, its privacy budget bounds the number of tasks an analyst can perform with reasonable accuracy, which makes it challenging to deploy in practice. This can be alleviated by private sketching, where the dataset is compressed into a single noisy sketch vector which can be shared with the analysts and used to perform arbitrarily many analyses. However, the algorithms to perform specific tasks from sketches must be developed on a case-by-case basis, which is a major impediment to their use. In this paper, we introduce the generic moment-to-moment (M2^2M) method to perform a wide range of data exploration tasks from a single private sketch. Among other things, this method can be used to estimate empirical moments of attributes, the covariance matrix, counting queries (including histograms), and regression models. Our method treats the sketching mechanism as a black-box operation, and can thus be applied to a wide variety of sketches from the literature, widening their ranges of applications without further engineering or privacy loss, and removing some of the technical barriers to the wider adoption of sketches for data exploration under differential privacy. We validate our method with data exploration tasks on artificial and real-world data, and show that it can be used to reliably estimate statistics and train classification models from private sketches.Comment: Published at the 18th International Workshop on Security and Trust Management (STM 2022

    Synthetic Data -- what, why and how?

    Full text link
    This explainer document aims to provide an overview of the current state of the rapidly expanding work on synthetic data technologies, with a particular focus on privacy. The article is intended for a non-technical audience, though some formal definitions have been given to provide clarity to specialists. This article is intended to enable the reader to quickly become familiar with the notion of synthetic data, as well as understand some of the subtle intricacies that come with it. We do believe that synthetic data is a very useful tool, and our hope is that this report highlights that, while drawing attention to nuances that can easily be overlooked in its deployment.Comment: Commissioned by the Royal Society. 57 pages 2 figure

    A Framework for Auditable Synthetic Data Generation

    Full text link
    Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what statistical patterns are captured, leading to concerns over privacy protection. While synthetic records are not linked to a particular real-world individual, they can reveal information about users indirectly which may be unacceptable for data owners. There is thus a need to empirically verify the privacy of synthetic data -- a particularly challenging task in high-dimensional data. In this paper we present a general framework for synthetic data generation that gives data controllers full control over which statistical properties the synthetic data ought to preserve, what exact information loss is acceptable, and how to quantify it. The benefits of the approach are that (1) one can generate synthetic data that results in high utility for a given task, while (2) empirically validating that only statistics considered safe by the data curator are used to generate the data. We thus show the potential for synthetic data to be an effective means of releasing confidential data safely, while retaining useful information for analysts

    Compressive Learning with Privacy Guarantees

    Get PDF
    International audienceThis work addresses the problem of learning from large collections of data with privacy guarantees. The compressive learning framework proposes to deal with the large scale of datasets by compressing them into a single vector of generalized random moments, from which the learning task is then performed. We show that a simple perturbation of this mechanism with additive noise is sufficient to satisfy differential privacy, a well established formalism for defining and quantifying the privacy of a random mechanism. We combine this with a feature subsampling mechanism, which reduces the computational cost without damaging privacy. The framework is applied to the tasks of Gaussian modeling, k-means clustering and principal component analysis (PCA), for which sharp privacy bounds are derived. Empirically, the quality (for subsequent learning) of the compressed representation produced by our mechanism is strongly related with the induced noise level, for which we give analytical expressions

    Theoretical models for web search privacy through query obfuscation

    No full text
    With the emergence of the Big Data era, privacy has become an increasingly important issue. The constant and ubiquitous logging of personal and professional data raises concerns, as this data is used for commercial, political or juridic purposes, with little to no regards for the users’ intimacy. In particular, Web search – the activity through which users of search engines access information on the Internet from their search queries – has recently come to light as an area where privacy is both primordial and as of now unachievable. Indeed, Web search data is seen as extremely intimate, as it may contain commercial, financial and medical information, yet very few solutions exist to protect its privacy. A promising solution that has been proposed is query obfuscation, where a program on the user’s computer sends many artificial queries in the hope of drowning the user’s queries in noise. This approach is valuable, as it makes the user the sole responsible of her own privacy, and additionally ensures protection against an eavesdropper. However, no obfuscator developed up to now has been proven to address the privacy issues of Web search data in a meaningful way, and existing implementations have been shown to be either unusable or useless in practice. Assessing whether efficient and effective obfuscators can be designed is a crucial question for the future of Web search privacy. In this master thesis, we propose a novel framework for the analysis and design of query obfuscators. Our contributions are fourfold. Firstly, we analyze the literature and discuss the user’s needs to define design principles for obfuscators. Secondly, we define three novel privacy notions that answer these needs. Thirdly, we introduce a new model for practical obfuscators that implements the principles discussed. Fourthly, we build on this model and notions to discuss the feasibility of query obfuscation for Web search. Our conclusion is that query obfuscation is not a suitable solution for Web search privacy, but it is nonetheless a surprisingly valuable tool. Indeed, query obfuscation is a powerful technique that is inadequate to address the daunting task of Web search data, due to the sheer volume of data involved. We argue that the rigorous analysis proposed in this master thesis serves as a strong – and arguably the first – basis for the study of obfuscators as a solution to privacy issues in other domains, such as the privacy of patent search.Master [120] : ingĂ©nieur civil en mathĂ©matiques appliquĂ©es, UniversitĂ© catholique de Louvain, 201

    Web Privacy: A Formal Adversarial Model for Query Obfuscation

    No full text
    The queries we perform, the searches we make, and the websites we visit — this sensitive data is collected at scale by companies as part of the services they provide. Query obfuscation, intertwining the user queries with artificial queries, has been proposed as a solution to protect the privacy of individuals on the web. We here present a formal model and formulate through attack models three privacy requirements for obfuscators: 1) indistinguishability, that the user query should be hard to identify; 2) coverage, that its topic should be hard to identify; and 3) imprecision, that the query should still be hard to identify for an attacker with additional auxiliary information. The latter is needed to make the former two guarantees “future-proof”. Using our framework, we derive two important results for obfuscators. First, we show that indistinguishability imposes strong bounds on the coverage and imprecision achievable by an obfuscator. Second, we prove an important tradeoff between coverage and imprecision, which inherently limits the strength and robustness of the privacy guarantees that an obfuscator can provide. We then introduce a family of obfuscators with provable indistinguishability guarantees, which we call k−k- ball obfuscators, and show, for a range of parameter values, the achievable coverage and imprecision. We show empirically that our theoretical tradeoff holds, and that its bound is not tight in practice: even in a simple idealized setting, there is a significant gap between practical coverage and imprecision guarantees, and the optimal bounds. While obfuscators have proven popular with the general public, all obfuscators currently available provide ad-hoc guarantees, and have been shown to be vulnerable to attacks, putting the data of users at risk. We hope this work to be a first step towards a robust evaluation of the properties of query obfuscators and the development of principled obfuscators

    When the Signal is in the Noise: Exploiting Diffix's Sticky Noise

    No full text
    Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of "sticky noise", Diffix has been recently proposed as a novel query-based mechanism satisfying alone the EU Article 29 Working Party's definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries. This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer private information about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions. We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP's definition of anonymization. We conclude by discussing how non-provable privacy-preserving systems can be combined with fundamental security principles such as defense-in-depth and auditability to build practically useful anonymization systems without relying on differential privacy

    TAPAS: a toolbox for adversarial privacy auditing of synthetic data

    No full text
    Personal data collected at scale promises to improve decision-making and accelerate innovation. However, sharing and using such data raises serious privacy concerns. A promising solution is to produce synthetic data, artificial records to share instead of real data. Since synthetic records are not linked to real persons, this intuitively prevents classical re-identification attacks. However, this is insufficient to protect privacy. We here present PrivE, a toolbox of attacks to evaluate synthetic data privacy under a wide range of scenarios. These attacks include generalizations of prior works and novel attacks. We also introduce a general framework for reasoning about privacy threats to synthetic data and showcase PrivE on several examples
    corecore